Importing the necessary libraries

library(dplyr)
library(tidyverse)
library(ggplot2)

Reading the data

data <- read.csv("/home/riya/BRN/gapminder_clean.csv")%>%
  as.tibble()
head(data)
## # A tibble: 6 × 20
##       X Country.Name  Year Agriculture..value.added....…¹ CO2.emissions..metri…²
##   <int> <chr>        <int>                          <dbl>                  <dbl>
## 1     0 Afghanistan   1962                             NA                 0.0738
## 2     1 Afghanistan   1967                             NA                 0.124 
## 3     2 Afghanistan   1972                             NA                 0.131 
## 4     3 Afghanistan   1977                             NA                 0.183 
## 5     4 Afghanistan   1982                             NA                 0.166 
## 6     5 Afghanistan   1987                             NA                 0.276 
## # ℹ abbreviated names: ¹​Agriculture..value.added....of.GDP.,
## #   ²​CO2.emissions..metric.tons.per.capita.
## # ℹ 15 more variables:
## #   Domestic.credit.provided.by.financial.sector....of.GDP. <dbl>,
## #   Electric.power.consumption..kWh.per.capita. <dbl>,
## #   Energy.use..kg.of.oil.equivalent.per.capita. <dbl>,
## #   Exports.of.goods.and.services....of.GDP. <dbl>, …

1. Filter the data to include only rows where Year is 1962 and
a) make a scatter plot comparing ‘CO2 emissions (metric tons per capita)’ and gdpPercap for the filtered data
b) calculate the correlation of ’CO2 emissions (metric tons per capita)’and gdpPercap. What is the correlation and associated p value?

#filtering the data to include rows where Year is equal to 1962
filtered_data1<-data %>%
  filter(Year==1962)
filtered_data1 %>%
  ggplot(aes(x=CO2.emissions..metric.tons.per.capita.,y=gdpPercap))+
  geom_point()#+

 # ggsave("scatterPlot.png",path="/home/riya/BRN/Plots")
cor_res<- cor.test(filtered_data1$CO2.emissions..metric.tons.per.capita.,filtered_data1$gdpPercap)
cor_res
## 
##  Pearson's product-moment correlation
## 
## data:  filtered_data1$CO2.emissions..metric.tons.per.capita. and filtered_data1$gdpPercap
## t = 25.269, df = 106, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8934697 0.9489792
## sample estimates:
##       cor 
## 0.9260817
cor_res$p.value
## [1] 1.128679e-46

The correlation coefficient of approximately 0.9261 suggests a very strong positive linear relationship between “CO2 emissions (metric tons per capita)” and “GDP per capita”. The confidence interval between 0.8934697 and 0.9489792 further supports this correlation.

2. On the unfiltered data, answer “In what year is the correlation between ’CO2 emissions (metric tons per capita)’and gdpPercap the strongest?” Filter the dataset to that year for the next step…

data %>%
  filter(complete.cases(CO2.emissions..metric.tons.per.capita.,gdpPercap))%>%
  group_by(Year)%>%
  summarise(cor=cor(CO2.emissions..metric.tons.per.capita.,gdpPercap))%>%
  slice(which.max(cor))
## # A tibble: 1 × 2
##    Year   cor
##   <int> <dbl>
## 1  1967 0.939
filtered_data2 <- data%>%
  filter(Year==1967)
head(filtered_data2)
## # A tibble: 6 × 20
##       X Country.Name    Year Agriculture..value.added..…¹ CO2.emissions..metri…²
##   <int> <chr>          <int>                        <dbl>                  <dbl>
## 1     1 Afghanistan     1967                         NA                    0.124
## 2    11 Albania         1967                         NA                    1.36 
## 3    21 Algeria         1967                         10.3                  0.632
## 4    31 American Samoa  1967                         NA                   NA    
## 5    41 Andorra         1967                         NA                   NA    
## 6    51 Angola          1967                         NA                    0.167
## # ℹ abbreviated names: ¹​Agriculture..value.added....of.GDP.,
## #   ²​CO2.emissions..metric.tons.per.capita.
## # ℹ 15 more variables:
## #   Domestic.credit.provided.by.financial.sector....of.GDP. <dbl>,
## #   Electric.power.consumption..kWh.per.capita. <dbl>,
## #   Energy.use..kg.of.oil.equivalent.per.capita. <dbl>,
## #   Exports.of.goods.and.services....of.GDP. <dbl>, …

3. Using plotly, create an interactive scatter plot comparing ’CO2 emissions (metric tons per capita)’and gdpPercap, where the point size is determined by pop (population) and the color is determined by the continent

library(plotly)
p <- filtered_data2 %>%
  ggplot(aes(x=CO2.emissions..metric.tons.per.capita.,y=gdpPercap,color=continent))+
  geom_point(aes(pop))

ggplotly(p)
#ggsave("emVsgdp_scatterplot.png",plot=p,path ="/home/riya/BRN/Plots" )

4. What is the relationship between continent and ‘Energy use (kg of oil equivalent per capita)’?

# plotting a boxplot to visualise the relationship between these variables
data %>%
  ggplot(aes(x=continent,y=Energy.use..kg.of.oil.equivalent.per.capita.))+
  geom_boxplot()

 # ggsave("boxplot.png",path="/home/riya/BRN/Plots")

Here, from above plot there seems to some differences in the energy across different continents, particularly - Asia, Europe and Oceania(highest median observed for Oceania). We will test significance of these differences statistically using ANOVA test.

aov_model <- aov(data$Energy.use..kg.of.oil.equivalent.per.capita. ~ data$continent)
summary(aov_model) 
##                  Df    Sum Sq   Mean Sq F value Pr(>F)    
## data$continent    5 8.124e+08 162482656   21.88 <2e-16 ***
## Residuals      1404 1.043e+10   7426183                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1197 observations deleted due to missingness

Here, the observed p-value is very small(<2e-16) and provides a strong evidence to reject null hypothesis. This indicates statistically significant differences in the energy use across the continents.

5. Is there a significant difference between Europe and Asia with respect to ‘Imports of goods and services (% of GDP)’ in the years after 1990?

# density plot to visualise the differences in imports of goods and services in two continents
data %>%
  filter(Year>1990 & continent %in% c('Europe','Asia'))%>%
  ggplot(aes(x=Imports.of.goods.and.services....of.GDP.,fill=continent))+
  geom_density(alpha=0.3)+
  labs(title="Imports of goods and services between Europe and Asia")

# stats
my_Data <- data %>%
  filter(Year>1990)%>%
  select(continent,Imports.of.goods.and.services....of.GDP.)%>%
  filter(continent %in% c('Europe','Asia'))
t.test(Imports.of.goods.and.services....of.GDP. ~ continent,my_Data)
## 
##  Welch Two Sample t-test
## 
## data:  Imports.of.goods.and.services....of.GDP. by continent
## t = 1.3552, df = 137.53, p-value = 0.1776
## alternative hypothesis: true difference in means between group Asia and group Europe is not equal to 0
## 95 percent confidence interval:
##  -2.321099 12.433240
## sample estimates:
##   mean in group Asia mean in group Europe 
##             46.84531             41.78924

Based on the results, the p-value of 0.1776 is greater than the typical significance level of 0.05. This means we cannot reject the null hypothesis indicating there is no significant difference in import of goods and services between Asia and Europe.

6. What is the country (or countries) that has the highest ‘Population density (people per sq. km of land area)’ across all years?

data%>%
  group_by(Country.Name)%>%
  summarise(mean=mean(Population.density..people.per.sq..km.of.land.area.))%>%
  slice(which.max(mean))
## # A tibble: 1 × 2
##   Country.Name       mean
##   <chr>             <dbl>
## 1 Macao SAR, China 14732.

China has the highest ‘Population density (people per sq. km of land area)’ across all years.

7. What country (or countries) has shown the greatest increase in ‘Life expectancy at birth, total (years)’ between 1962 and 2007?

data %>%
  filter(Year %in% c(1962,2007)) %>%
  select(Year,Country.Name,Life.expectancy.at.birth..total..years.)%>%
  group_by(Country.Name)%>%
  pivot_wider(names_from = Year,values_from = Life.expectancy.at.birth..total..years.)%>%
  mutate(diff_LE=`2007`-`1962`)%>%
  arrange(desc(diff_LE))
## # A tibble: 263 × 4
## # Groups:   Country.Name [263]
##    Country.Name       `1962` `2007` diff_LE
##    <chr>               <dbl>  <dbl>   <dbl>
##  1 Maldives             38.5   75.4    36.9
##  2 Bhutan               33.1   66.3    33.2
##  3 Timor-Leste          34.7   65.8    31.1
##  4 Tunisia              43.3   74.2    30.9
##  5 Oman                 44.3   75.1    30.8
##  6 Nepal                36.0   66.6    30.6
##  7 China                44.4   74.3    29.9
##  8 Yemen, Rep.          34.7   62.0    27.2
##  9 Saudi Arabia         46.7   73.3    26.7
## 10 Iran, Islamic Rep.   46.1   72.7    26.6
## # ℹ 253 more rows

Maldives saw greatest increase in Life expectancy at birth between year 1962 and 2007.